Bootstrapping Information Extraction from Field Books
نویسندگان
چکیده
We present two machine learning approaches to information extraction from semi-structured documents that can be used if no annotated training data are available, but there does exist a database filled with information derived from the type of documents to be processed. One approach employs standard supervised learning for information extraction by artificially constructing labelled training data from the contents of the database. The second approach combines unsupervised Hidden Markov modelling with language models. Empirical evaluation of both systems suggests that it is possible to bootstrap a field segmenter from a database alone. The combination of Hidden Markov and language modelling was found to perform best at this task.
منابع مشابه
Multi-Field Information Extraction and Cross-Document Fusion
In this paper, we examine the task of extracting a set of biographic facts about target individuals from a collection of Web pages. We automatically annotate training text with positive and negative examples of fact extractions and train Rote, Naı̈ve Bayes, and Conditional Random Field extraction models for fact extraction from individual Web pages. We then propose and evaluate methods for fusin...
متن کاملBootstrapping an Ontology-based Information Extraction System
Automatic intelligent web exploration will benefit from shallow information extraction techniques if the latter can be brought to work within many different domains. The major bottleneck for this, however, lies in the so far difficult and expensive modeling of lexical knowledge, extraction rules, and an ontology that together define the information extraction system. In this paper we present a ...
متن کاملResearch on Domain-independent Opinion Target Extraction
Opinion Target Extraction is one of the important tasks for text sentiment analysis, which has attracted much attention from many researchers. For this task, we proposed an M-Score algorithm utilized in the model which realized the domain-independent opinion target extraction function. This algorithm is derived from the Pointwise Mutual Information algorithm, but the difference is that it doesn...
متن کاملBoemie: Bootstrapping Ontology Evolution with Multimedia Information Extraction
The BOEMIE project proposes a bootstrapping approach to knowledge acquisition, which uses multimedia ontologies for fused extraction of semantics from multiple modalities, and feeds back the extracted information, aiming to automate the ontology evolution process.
متن کاملOntology-Based Information Extraction under a Bootstrapping Approach
The authors present an ontology-based information extraction process, which operates in a bootstrapping framework. The novelty of this approach lies in the continuous semantics extraction from textual content in order to evolve the underlying ontology, while the evolved ontology enhances in turn the information extraction mechanism. This process was implemented in the context of the R&D project...
متن کامل